Preface

Project Goal: Threefold: (1) Transformation and cleaning of data. (2) Exploratory Data Analysis of Dating Site Data. (3) Machine Learning modelling.

Create Themes and Load Data:

# Clear environment
rm(list = ls())
options(scipen = 999)

# Load packages in use
pacman::p_load(dplyr, ggplot2, tidyverse, rsample, caret, glmnet, vip, pdp, stringr, 
               tidytext, emoji, stopwords, ggridges, wordcloud2, ggmap, readxl, maps,
               viridis, eurostat, corrplot, GGally, reshape2, grid, rpart, rpart.plot, 
               randomForest)

# Define plot themes and palettes
palette <- c("#1beaa7", "#00d9d3", "#00c2ff", "#00a5ff", "#007bff", "#8c2aef")

th <- theme(
  # Background and grid
  panel.background = element_blank(),
  plot.background = element_rect(fill = "#000123", color = "#000123"),
  panel.grid.major = element_line(color = "white", size = 0.1),
  panel.grid.minor = element_line(color = "white", size = 0.1),
  
  # Axis titles and labels
  axis.title.x = element_text(colour = "white",size = 12,family = "arial",vjust = -2,hjust = 0.5,face = "bold"),
  axis.title.y = element_text(colour = "white",size = 12,family = "arial",vjust = 3,hjust = 0.5,face = "bold"),
  axis.text.y = element_text(colour = "white",size = 10,family = "arial"),
  axis.text.x = element_text(colour = "white",size = 10,family = "arial"),
  
  # Margins and spacing
  plot.margin = unit(c(0.5, 0.5, 0.5, 0.5), "cm"),
  
  # Title, subtitle, and caption
  plot.title = element_text(colour = "white",size = 16,family = "arial",hjust = 0.5,face = "bold"),
  plot.subtitle = element_text(colour = "white",size = 14,family = "arial"),
  plot.caption = element_text(colour = "white",size = 10,family = "arial"),
  
  # Legend
  legend.position = "right",
  legend.text = element_text(colour = "white",size = 12,family = "arial"),
  legend.title = element_text(colour = "white",size = 12,family = "arial",hjust = 3,face = "bold"),
  legend.key = element_rect(fill = "#000123", color = "#000123"),
  legend.background = element_rect(fill = "#000123"),
  
  # Other
  axis.ticks = element_blank(),
  strip.text = element_text(colour = "white",size = 12,family = "arial",vjust = 1,hjust = 0.5)
)


# Import data
path <- "/Users/janhendrikpretorius/Library/CloudStorage/OneDrive-StellenboschUniversity/Masters-2023/Modules/Data Science/DataScience-871-repo/JHPretorius-Project/Candidate Data Sets/Dating/"
file <- "lovoo_v3_users_api-results.csv"

df <- read_csv(paste0(path, file))

Introduction

The objective of this study is to investigate the critical factors that contribute to an individual’s appeal, popularity, and recognition within an online dating platform. The data utilised for this research is sourced from Lovoo, a prominent European dating application, and is accessible via Kaggle.

The underlying motivation for this study stems from the desire to comprehend behavioural patterns that transcend the confines of physical attractiveness. The aim is to unveil hidden determinants that may shape interpersonal interactions within a digital dating platform. The behaviour exhibited on these platforms carries significance, even in economic contexts. By deciphering this behavioural paradigm, it can potentially contribute to the development of economic models. These enhanced models can subsequently offer a more profound analytic framework to elucidate overall mate-selection behaviour.

The initial phase of the analysis involves the transformation of raw data into a more interpretable format. This includes the creation of additional variables tailored to augment the predictive capacity of the statistical models employed in subsequent stages. This phase facilitates the exploratory aspect of the research, enabling an in-depth examination of data in search of potential predictor variables. The objective extends beyond understanding the phenomena; the aim is to anticipate which factors instigate an increased number of profile views and, subsequently, the ‘likes’ received.

The modelling process is a two-step approach. The first stage focuses on identifying variables that may elucidate why individuals view a certain profile. Potential variables include online presence, age, geographical location, and the timing of an individual’s online activity. The second stage aims to identify factors that influence the likelihood of a profile receiving ‘likes’. These may include the number of pictures on a profile, the characteristics of a profile’s biography, languages spoken, profile verification status, and mobile usage.

A decision tree model and random forest model are employed in this study, simply due to their robust ability to discern intricate characteristics that influence outcomes. The application of decision trees and random forests provides a comprehensive analysis approach, as it combines the simplicity and interpretability of decision trees, with the robustness and increased accuracy of random forests, to predict the distinct measures of user engagement on the dating platform. By training two models for each method, I ensure a more detailed understanding of the factors driving both profile views and likes.

Part 1: Transformation and Cleaning

The dataset in consideration comprises 3 973 observations approximately 30 variables, each encapsulating specific attributes pertaining to individual profiles and related demographic information. An excerpt of the dataset is provided subsequently, supplemented by Table 1 which elucidates a selection of significant variables. It’s noteworthy to mention that the dataset solely encompasses data of individuals identifying as female. As such, the core objective of this analysis is to discern the determinants influencing the behavioural patterns of individuals displaying interest in females.

Head of dataframe

## # A tibble: 6 × 42
##   gender genderLooking   age name         counts_details counts_pictures
##   <lgl>  <chr>         <dbl> <chr>                 <dbl>           <dbl>
## 1 FALSE  M                25 daeni                  1                  4
## 2 FALSE  M                22 italiana 92            0.85               5
## 3 FALSE  M                21 Lauraaa                0                  4
## 4 FALSE  none             20 Qqkwmdowlo             0.12               3
## 5 FALSE  M                21 schaessie {3           0.15              12
## 6 FALSE  M                24 Baby dee               0.81              18
## # ℹ 36 more variables: counts_profileVisits <dbl>, counts_kisses <dbl>,
## #   counts_fans <dbl>, counts_g <dbl>, flirtInterests_chat <lgl>,
## #   flirtInterests_friends <lgl>, flirtInterests_date <lgl>, country <chr>,
## #   city <chr>, location <chr>, distance <dbl>, isFlirtstar <dbl>,
## #   isHighlighted <dbl>, isInfluencer <dbl>, isMobile <dbl>, isNew <dbl>,
## #   isOnline <dbl>, isVip <dbl>, lang_count <dbl>, lang_fr <lgl>,
## #   lang_en <lgl>, lang_de <lgl>, lang_it <lgl>, lang_es <lgl>, …

Table 1: Description of variables in data set.

Variable Description
genderLooking Preferred gender the subject is looking to engage with. Represented as ‘M’ for male, ‘F’ for female, ‘both’ for male and female, or ‘none’.
age Age of the individual.
counts_details How complete the profile is. Proportion of detail in the account. Measured from 0.0-1.0.
counts_pictures How many pictures does the profile contain.
counts_profileVisits How many times the profile has been viewed.
counts_kisses Number of ‘kisses’ or ‘likes’ received by profile.
flirtInterests_* What the individual is interested in. ‘*’ represents: ‘chat’, ‘date’, ‘friends’.
verified Whether the profile has been verified or not.
lang_count Number of languages spoken by an individual.
lang_* Language spoken by an individual. ‘*’ represents: ‘en’ (English), ‘de’ (German), ‘fr’ (French), ‘it’ (Italian), ‘es’ (Spanish).
whazzup A phrase that represents the profile’s ‘bio’.

The original dataset is already quite useable, but we can produce better models by adding some new variables. The first step is to take a closer look at the language people use in their profiles. I am focusing on two main things here: the words used in the profile descriptions, and the use of emojis. Both of these could give insights into the person’s confidence and desirability.

The code chunk below creates two new dummy variables, has_emoji and contains_popular_word. has_emoji attributes a ‘1’ based on whether wazzup contains an emoji. contains_popular_word attributes a ‘1’ based on whether wazzup contains a popular word. The code chunk also outputs which words are the most popular in a word cloud. (The word cloud is a dynamic image that shows the popularity when hovering over a specific word)

# Define stop words for different languages
all_stop_words <- c(stopwords::stopwords("de"), stopwords::stopwords("en"), stopwords::stopwords("fr"))

# Define dummy variable that detects presence of emojis
# Also remove digits from 'whazzup' column
df <- df %>%
  mutate(has_emoji = ifelse(emoji_detect(whazzup), 1, 0),
         whazzup = str_remove_all(whazzup, "[[:digit:]]+"))

# Get the most used words in profile
# First, create a table of words with the corresponding counts_profileVisits
words_visits <- df %>%
  unnest_tokens(word, whazzup) %>%
  select(word, counts_profileVisits)

# Then calculate the mean counts_profileVisits for each word and its count
words <- words_visits %>%
  group_by(word) %>%
  summarise(mean_profileVisits = mean(counts_profileVisits, na.rm = TRUE),
            word_count = n(), 
            .groups = "drop")

# Create word popularity index and determine popular words
words <- words %>%
  mutate(popularity_index = 0.8 * word_count + 0.2 * mean_profileVisits) %>% 
  filter(popularity_index > 200 & word_count > 10, 
         word != "", 
         !is.na(word), 
         !is.na(mean_profileVisits), 
         !is.na(word_count), 
         !is.na(popularity_index)) %>%
  filter(!word %in% all_stop_words)

# Create a single pattern string that matches any word in words$word
words_pattern <- paste(words$word, collapse = "|")

# Add the new variable to df
df <- df %>%
  mutate(contains_popular_word = ifelse(str_detect(whazzup, words_pattern), 1, 0)) %>%
  mutate(
    contains_popular_word = replace_na(contains_popular_word, 0),
    has_emoji = replace_na(has_emoji, 0)
  )

words <- words %>%
  arrange(desc(popularity_index)) %>% 
  select(c(word, popularity_index))

Part 2: Exploratory Data Analysis

This segment aims to identify underlying patterns and relationships within the dataset. An initial step involves visually inspecting the variables, helping to assess their potential relevance and impact on the outcomes of interest. As a fundamental part of exploratory data analysis, these visual inspections allows one to discern which features could be instrumental in shaping predictive models.

As hinted in the introductory section, it quickly becomes apparent that specific variables have a more pronounced influence on the number of profile ‘Likes’, while others may largely dictate the number of ‘Profile Views’. This distinction is crucial, as certain profile elements only become observable once a profile is viewed. For instance, the information in a profile biography only comes into play during a profile view. Therefore, the dynamics of what draws views and subsequently encourages likes may differ significantly, although both are important aspects of profile engagement.

Interestingly, despite these differences, one notices a robust correlation between profile views and likes. This interplay implies that a successful profile is not just about attracting views but also about converting those views into likes. Figure 2 visually represents this relationship, further illuminating the interdependent nature of profile views and likes. Uncovering these patterns provides essential insights that can inform our subsequent modeling efforts.

Figure 2: Bubble plot of profile views and number of pictures in profile. A non-linear model (loess method) was fitted on the plot to discern possible patterns and differences between bios with emojis and those without. Size of dots present how detailed the account is.

Biography characteristics and popularity

Figure 3 below aims to present whether there is a difference in the distribution of likes received based on the newly created dummy variables, has_emoji, contains_popular_word, and night_owl. There seem to be some slight differences in likes received, supporting the idea that the use of emojis and certain words do suggest higher levels of trust. Being online during night time also may increase profile views, but I view this variable more as a control variable, rather than a causal one, as more people tend to be online during night time than in day time.

Figure 3: Boxplots showing effects of profile characteristics on popularity. Left panel: effect of a bio containing social media particulars and/or an emoji on likes received. Right panel: effect of an online profile and/or being a night owl on number of profile visits.

Geographical characteristics and popularity

Utilising the Google Maps API, I successfully geocoded the locations of all profiles present in the dataset. The primary objective behind this was to explore and visualise the potential impact of geographical location on profile views. The role of location might be significant, considering how geographical and cultural aspects can influence user interactions and preferences on the platform. The code snippet below shows the process to perform the geocoding operation.

# Note: commented out, due to costs associated with geocoding through the API
# df_city <- df %>% 
#   select(c(city, country, counts_profileVisits)) %>% 
#   mutate(address = paste0(city, ", ", country)) %>%
#   group_by(address) %>%
#   summarise(mean_profile_views = mean(counts_profileVisits, na.rm = TRUE))
# 
# df_city <- df_city %>%
#   mutate(geocode_data = map(address, ~geocode(.x, source = "google", output = "latlon")),
#          lon = map_dbl(geocode_data, "lon"),
#          lat = map_dbl(geocode_data, "lat"))
# 
# write_csv(df_city, "geocode_latlon.csv")

The subsequent bubble plot illustrates a few disparities among cities. However, these contrasts are not significant enough to confirm any clear geographical trends in profile views. Thus, it is not feasible to definitively say that some regions show more inclination towards profile views than others based on this representation.

To gain a more insightful understanding, a choropleth map is utilized. This geographical representation not only gives a visual interpretation of data but also enhances comprehension through color-coding. Upon implementing this, it becomes noticeably clear that certain countries indeed experience higher profile views on average.

In particular, profiles originating from Spain, Hungary, and the Netherlands tend to attract more attention compared to other European countries. The reasons behind these trends can be plenty - cultural nuances, user behaviors, or the presence of more active users in these regions. Future investigation might delve deeper into these aspects to provide more concrete explanations for the observed patterns.

Figure 4: Geographic data visualisation of profile views. Size and colour of bubbles in top panel indicate profile views. Colour of country in the bottom panel indicate profile views.

When we visualized the data on a map, we noticed that profiles from certain countries tend to get more views. But there’s more to the story than geography.

I produced a lollipop chart (figure 5 below) to show the number of users in each region, with the colour of the lollipop indicating mean profile views. What we see is interesting - a country’s overall popularity didn’t necessarily match up with the number of profile views. This discrepancy can be chalked up to what we call ‘sample size bias.’ Simply put, countries with less users naturally have a higher total number of views, due to a few very popular individuals pushing up the numbers.

As it turns out that using a profile’s country of origin as a way to predict its popularity might be misleading. To make the final model as accurate as possible, it was decided to leave this variable out of the mix.

Figure 5: Lollipop chart of number of users by country. The colour of the lollipops indicate mean profile views.

Other profile characteristics and popularity

In this sub-section, the objective is to ascertain the impact of various profile attributes on the degree of popularity experienced on the dating application. The attributes under scrutiny regard an array of factors, including the number of pictures a profile has, its verification status, whether it can be shared, and the expressed interests of the profile owner, among others.

In order to illuminate the relationships between these variables and the response - profile likes - a correlogram has been produced, which reveals some notable insights. For instance, a slight negative correlation is observed between profile views and factors such as age, interests leaning towards ‘just friends’, and shareability of the profile. On the contrary, having a verified status and showcasing multilingual abilities are positively correlated with profile likes, signifying their potential influence in enhancing a profile’s appeal.

Figure 6: Correlogram of profile characteristics and number of likes received.

The significance of language as a determinant of popularity was also explored in this analysis. This was reflected in Figure 6, where the number of languages spoken was considered as a potential predictor of popularity. Subsequently, Figure 7 provides a visualisation of the distribution of received likes in relation to specific languages spoken by the profiles.

Despite these considerations, the investigation does not reveal a discernible difference in the distribution of profile likes contingent on the languages spoken. The absence of any substantial differentiation in this context suggested that the language factor may not hold significant sway over profile popularity. Consequently, the language variable was not included in the formulation of the final predictive models.

Figure 7: Ridgeline plot of languages spoken and number of likes received. Dashed line shows overall mean profile likes.

The perceived attractiveness of a profile is often regarded as a significant determinant of mate searching behaviour. However, the dataset at hand does not include any direct measures of perceived attractiveness. Nevertheless, we have access to a proxy for this attribute, namely the number of pictures present in a profile. While it may not be the most accurate representation of attractiveness, it offers some insight into the visual appeal of a profile.

In conjunction with this, the presence of social media tags on a profile was also examined, given that these tags may serve as additional indicators of social validation or popularity.

Upon examining Figure 8, we observe a correlation between the number of pictures in a profile and the number of profile likes. Specifically, profiles with a lower number of pictures tend to have fewer profile likes, compared to their counterparts with a similar number of pictures but also featuring social media tags. As the number of pictures increases, the distinction between profiles with and without social media tags becomes less apparent.

This implies that while social media tags can enhance the visibility of a profile, their impact diminishes as the number of pictures increases. Thus, the number of pictures in a profile, serving as a rudimentary indicator of attractiveness, can also influence the popularity of a profile to some degree.

Figure 8: Dotted line plot of the number of pictures in profile and likes received. The mean number of likes received by number of photos was used to plot this relationship. Lines split based on social media tag presence in profile.

Part 3: Modelling & Results

After data preparation and exploration, we can proceeed with modelling. As a part of the modelling approach for this analysis, I have elected to implement two popular machine learning techniques: decision trees and random forests. These methods were chosen due to their interpretability, effectiveness in handling complex datasets, and their capacity for both classification and regression tasks.

In each of these chosen techniques, two separate models were trained to serve distinct predictive purposes. The first model targets the prediction of profile views, while the second model aims at forecasting profile likes. This dual-model approach was adopted in recognition of the distinct factors that could potentially influence these two different measures of user engagement. Each model is trained on a different set of predictor variables, carefully chosen based on the insights gathered during the data exploration phase.

The first step involves partitioning it into training and testing subsets. I proceed to divide the dataset into training and testing subsets. For this analysis, I have adopted the widely used practice of a 70/30 split, whereby 70% of the data forms the training set and the remaining 30% is reserved for testing. This allocation ensures a balance - ample data to train the model effectively, whilst retaining a substantial portion for assessing the model’s performance with unseen data. The code demonstrated below provides the method I employed to execute this data split. Moreover, I undertook this process twice. This resulted in two distinct sets - one set for profile views and another for profile likes, enabling a targeted examination of each aspect of profile engagement.

# Set seed for reproducibility
set.seed(777)

# Define training and testing sets for profile visits prediction
split_visits <- initial_split(df, prop = 0.7, strata = "Profile_Views")
training_visits <- training(split_visits)
testing_visits <- testing(split_visits)
# Set seed for reproducibility
set.seed(777)

# Define training and testing sets for likes prediction
split_kisses <- initial_split(df, prop = 0.7, strata = "Profile_Likes")
training_kisses <- training(split_kisses)
testing_kisses <- testing(split_kisses)

Decision Tree Model

Decision trees are a type of predictive modeling approach. It is called a decision tree because it brings about a tree-like model of decisions. In this instance, two decision tree models were constructed - one for profile visits and another for profile likes.

For the profile visits model, three predictor variables were considered: ‘isOnline’ (whether the user is currently online), ‘night_owl’ (whether the user is active during nighttime hours), and ‘age’.

The profile likes model, on the other hand, was slightly more complex, considering a wider array of variables including ‘has_emoji’, ‘has_social’, ‘Profile_Views’, ‘counts_pictures’, ‘lang_count’, ‘flirtInterests_chat’, ‘flirtInterests_date’, ‘flirtInterests_friends’, and ‘counts_details’. These variables were deemed to be potentially relevant to the number of likes a profile receives.

Once each decision tree model was trained, the ‘summary’ function was invoked to provide a comprehensive view of the models’ characteristics. It includes details such as variable importance, split points, and node summary, providing valuable insights into the models’ decision-making process.

# Train decision tree model for profile visits
visits_tree <- rpart(formula = Profile_Views ~ isOnline + night_owl + age,
                     data = training_visits, method = "class")
summary(visits_tree)
## Call:
## rpart(formula = Profile_Views ~ isOnline + night_owl + age, data = training_visits, 
##     method = "class")
##   n= 2780 
## 
##           CP nsplit rel error    xerror       xstd
## 1 0.09165067      0 1.0000000 1.0518234 0.01033211
## 2 0.01000000      1 0.9083493 0.9083493 0.01179280
## 
## Variable importance
##  isOnline night_owl       age 
##        76        14         9 
## 
## Node number 1: 2780 observations,    complexity param=0.09165067
##   predicted class=Low   expected loss=0.7496403  P(node) =1
##     class counts:   696   695   694   695
##    probabilities: 0.250 0.250 0.250 0.250 
##   left son=2 (1632 obs) right son=3 (1148 obs)
##   Primary splits:
##       isOnline  < 0.5  to the right, improve=28.021100, (0 missing)
##       age       < 20.5 to the right, improve=13.984950, (0 missing)
##       night_owl < 0.5  to the left,  improve= 4.592948, (0 missing)
##   Surrogate splits:
##       night_owl < 0.5  to the left,  agree=0.665, adj=0.188, (0 split)
##       age       < 20.5 to the right, agree=0.637, adj=0.121, (0 split)
## 
## Node number 2: 1632 observations
##   predicted class=Low   expected loss=0.6917892  P(node) =0.5870504
##     class counts:   503   426   392   311
##    probabilities: 0.308 0.261 0.240 0.191 
## 
## Node number 3: 1148 observations
##   predicted class=High  expected loss=0.6655052  P(node) =0.4129496
##     class counts:   193   269   302   384
##    probabilities: 0.168 0.234 0.263 0.334
# Train decision tree model for profile likes
kisses_tree <- rpart(formula = Profile_Likes ~ has_emoji + has_social + Profile_Views + counts_pictures + lang_count + flirtInterests_chat + flirtInterests_date + flirtInterests_friends + counts_details,
                     data = training_kisses, method = "class")
summary(kisses_tree)
## Call:
## rpart(formula = Profile_Likes ~ has_emoji + has_social + Profile_Views + 
##     counts_pictures + lang_count + flirtInterests_chat + flirtInterests_date + 
##     flirtInterests_friends + counts_details, data = training_kisses, 
##     method = "class")
##   n= 2780 
## 
##          CP nsplit rel error    xerror       xstd
## 1 0.3291445      0 1.0000000 1.0227163 0.01086573
## 2 0.1570807      1 0.6708555 0.6708555 0.01274182
## 3 0.1556307      2 0.5137748 0.5553407 0.01254886
## 4 0.0100000      3 0.3581440 0.3581440 0.01126769
## 
## Variable importance
##          Profile_Views        counts_pictures         counts_details 
##                     62                     17                      9 
##              has_emoji             lang_count             has_social 
##                      5                      3                      2 
## flirtInterests_friends    flirtInterests_chat 
##                      1                      1 
## 
## Node number 1: 2780 observations,    complexity param=0.3291445
##   predicted class=Low       expected loss=0.7442446  P(node) =1
##     class counts:   711   686   689   694
##    probabilities: 0.256 0.247 0.248 0.250 
##   left son=2 (1407 obs) right son=3 (1373 obs)
##   Primary splits:
##       Profile_Views   splits as  LLRR,     improve=458.58640, (0 missing)
##       counts_pictures < 4.5  to the left,  improve=103.51720, (0 missing)
##       counts_details  < 0.02 to the left,  improve= 43.10391, (0 missing)
##       has_emoji       < 0.5  to the left,  improve= 21.17671, (0 missing)
##       has_social      < 0.5  to the left,  improve= 20.01437, (0 missing)
##   Surrogate splits:
##       counts_pictures < 4.5  to the left,  agree=0.674, adj=0.339, (0 split)
##       counts_details  < 0.48 to the left,  agree=0.595, adj=0.180, (0 split)
##       has_emoji       < 0.5  to the left,  agree=0.565, adj=0.119, (0 split)
##       has_social      < 0.5  to the left,  agree=0.537, adj=0.063, (0 split)
##       lang_count      < 1.5  to the left,  agree=0.534, adj=0.056, (0 split)
## 
## Node number 2: 1407 observations,    complexity param=0.1570807
##   predicted class=Low       expected loss=0.5010661  P(node) =0.5061151
##     class counts:   702   551   150     4
##    probabilities: 0.499 0.392 0.107 0.003 
##   left son=4 (694 obs) right son=5 (713 obs)
##   Primary splits:
##       Profile_Views   splits as  LR--,     improve=249.320100, (0 missing)
##       counts_pictures < 1.5  to the left,  improve= 32.865990, (0 missing)
##       counts_details  < 0.02 to the left,  improve= 14.853900, (0 missing)
##       lang_count      < 3.5  to the right, improve=  5.032545, (0 missing)
##       has_emoji       < 0.5  to the left,  improve=  3.548000, (0 missing)
##   Surrogate splits:
##       counts_pictures        < 2.5  to the left,  agree=0.625, adj=0.239, (0 split)
##       counts_details         < 0.06 to the left,  agree=0.570, adj=0.128, (0 split)
##       flirtInterests_friends < 0.5  to the left,  agree=0.546, adj=0.079, (0 split)
##       has_emoji              < 0.5  to the left,  agree=0.526, adj=0.039, (0 split)
##       flirtInterests_chat    < 0.5  to the left,  agree=0.519, adj=0.024, (0 split)
## 
## Node number 3: 1373 observations,    complexity param=0.1556307
##   predicted class=High      expected loss=0.4974508  P(node) =0.4938849
##     class counts:     9   135   539   690
##    probabilities: 0.007 0.098 0.393 0.503 
##   left son=6 (675 obs) right son=7 (698 obs)
##   Primary splits:
##       Profile_Views   splits as  --LR,     improve=249.294100, (0 missing)
##       counts_pictures < 9.5  to the left,  improve= 21.748450, (0 missing)
##       lang_count      < 1.5  to the left,  improve=  8.590223, (0 missing)
##       has_social      < 0.5  to the left,  improve=  7.759364, (0 missing)
##       counts_details  < 0.98 to the left,  improve=  4.147261, (0 missing)
##   Surrogate splits:
##       counts_pictures     < 6.5  to the left,  agree=0.614, adj=0.215, (0 split)
##       has_emoji           < 0.5  to the left,  agree=0.547, adj=0.079, (0 split)
##       counts_details      < 0.67 to the left,  agree=0.543, adj=0.071, (0 split)
##       lang_count          < 1.5  to the left,  agree=0.536, adj=0.056, (0 split)
##       flirtInterests_chat < 0.5  to the left,  agree=0.532, adj=0.047, (0 split)
## 
## Node number 4: 694 observations
##   predicted class=Low       expected loss=0.1613833  P(node) =0.2496403
##     class counts:   582   106     6     0
##    probabilities: 0.839 0.153 0.009 0.000 
## 
## Node number 5: 713 observations
##   predicted class=Low Mid   expected loss=0.3758766  P(node) =0.2564748
##     class counts:   120   445   144     4
##    probabilities: 0.168 0.624 0.202 0.006 
## 
## Node number 6: 675 observations
##   predicted class=High Mid  expected loss=0.3659259  P(node) =0.2428058
##     class counts:     7   134   428   106
##    probabilities: 0.010 0.199 0.634 0.157 
## 
## Node number 7: 698 observations
##   predicted class=High      expected loss=0.1633238  P(node) =0.2510791
##     class counts:     2     1   111   584
##    probabilities: 0.003 0.001 0.159 0.837

The decision tree model for Profile Views was trained on a dataset encompassing 2780 observations, using isOnline, night_owl, and age as predictor variables. The variable isOnline was deemed the most crucial, contributing significantly to the total reduction of node impurity. night_owl and age followed in importance. Two key data divisions were created, reflected in the nsplit value, which successfully decreased the relative error. The primary division criteria revolved around isOnline, followed by age and night_owl. Supplementary division rules, termed surrogate splits, were also established.

Conversely, the decision tree model for Profile Likes was developed using a wider array of predictor variables: has_emoji, has_social, Profile_Views, counts_pictures, lang_count, flirtInterests_chat, flirtInterests_date, flirtInterests_friends, and counts_details. This model revealed Profile_Views as the most significant variable, with counts_pictures and counts_details next in line. The decision tree model generated four crucial splits, each progressively reducing the relative error. As with the first model, primary and surrogate splits were established, with the former centering around Profile_Views.

Each node in these decision tree models divulges key predictive details. To illustrate, Node 2 of the Profile Views tree houses 1632 observations. The predicted category here is ‘Low’, with the anticipated misclassification rate (expected loss) around 0.69. The node also provides a distribution of the target variable categories in terms of probabilities. This step is consistently applied across all nodes and both decision tree models.

The following code chunk applies the model to the test set:

# Predict on testing data
predict_visits <- predict(visits_tree, newdata = testing_visits, type = "class")
predict_kisses <- predict(kisses_tree, newdata = testing_kisses, type = "class")

# Confusion matrices for evaluation
confusionMatrix(predict_visits, testing_visits$Profile_Views)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Low Low Mid High Mid High
##   Low      221     184      159  139
##   Low Mid    0       0        0    0
##   High Mid   0       0        0    0
##   High      78     114      139  159
## 
## Overall Statistics
##                                           
##                Accuracy : 0.3185          
##                  95% CI : (0.2921, 0.3458)
##     No Information Rate : 0.2506          
##     P-Value [Acc > NIR] : 0.00000007922   
##                                           
##                   Kappa : 0.091           
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: Low Class: Low Mid Class: High Mid Class: High
## Sensitivity              0.7391         0.0000          0.0000      0.5336
## Specificity              0.4609         1.0000          1.0000      0.6302
## Pos Pred Value           0.3144            NaN             NaN      0.3245
## Neg Pred Value           0.8408         0.7502          0.7502      0.8023
## Prevalence               0.2506         0.2498          0.2498      0.2498
## Detection Rate           0.1852         0.0000          0.0000      0.1333
## Detection Prevalence     0.5893         0.0000          0.0000      0.4107
## Balanced Accuracy        0.6000         0.5000          0.5000      0.5819
confusionMatrix(predict_kisses, testing_kisses$Profile_Likes)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Low Low Mid High Mid High
##   Low      253      48        0    0
##   Low Mid   48     175       57    0
##   High Mid   4      70      186   57
##   High       0       1       53  241
## 
## Overall Statistics
##                                                
##                Accuracy : 0.7167               
##                  95% CI : (0.6902, 0.7421)     
##     No Information Rate : 0.2557               
##     P-Value [Acc > NIR] : < 0.00000000000000022
##                                                
##                   Kappa : 0.6222               
##                                                
##  Mcnemar's Test P-Value : NA                   
## 
## Statistics by Class:
## 
##                      Class: Low Class: Low Mid Class: High Mid Class: High
## Sensitivity              0.8295         0.5952          0.6284      0.8087
## Specificity              0.9459         0.8832          0.8540      0.9397
## Pos Pred Value           0.8405         0.6250          0.5868      0.8169
## Neg Pred Value           0.9417         0.8697          0.8744      0.9365
## Prevalence               0.2557         0.2464          0.2481      0.2498
## Detection Rate           0.2121         0.1467          0.1559      0.2020
## Detection Prevalence     0.2523         0.2347          0.2657      0.2473
## Balanced Accuracy        0.8877         0.7392          0.7412      0.8742

The output presents the confusion matrices and statistics for the two decision tree models’ performance on the testing data.

For the first model, it predicts four categories: Low, Low Mid, High Mid, and High. The model predicted Low for all observations and didn’t predict Low Mid and High Mid at all. As a result, the model’s accuracy is low at 31.85%, with a 95% confidence interval between 29.21% and 34.58%. The kappa statistic is 0.091, indicating poor agreement between the model’s predictions and the actual categories.

For each class, we can observe the following:

  1. Class Low: The model has a sensitivity or true positive rate of 73.91%, meaning it correctly identified 73.91% of the Low class instances. However, its positive predictive value (the proportion of true positives in the predicted positives) is just 31.44%. It indicates a high false positive rate. The model has a balanced accuracy of 60% for this class, which accounts for both sensitivity and specificity and is an overall measure of its performance.

  2. The model didn’t predict Low Mid and High Mid at all, which explains the zero values in Sensitivity, Pos Pred Value, and Detection Rate, and NA in Pos Pred Value.

  3. Class High: The model has a sensitivity of 53.36% and a positive predictive value of 32.45%, indicating that the model struggles to accurately identify and predict High class instances. The balanced accuracy is 58.19% for this class.

In the second model, the overall accuracy improves substantially to 71.67%, with a 95% confidence interval between 69.02% and 74.21%. The kappa statistic is 0.6222, indicating a moderate agreement between the model’s predictions and the actual categories.

For each class, we can observe the following:

  1. Class Low: The model has a high sensitivity of 82.95% and a positive predictive value of 84.05%. The balanced accuracy is 88.77% for this class, suggesting a good performance in identifying and predicting Low class instances.

  2. Class Low Mid: The model has a moderate sensitivity of 59.52% and a positive predictive value of 62.50%. The balanced accuracy for this class is 73.92%.

  3. Class High Mid: The model’s performance decreases for this class, with a sensitivity of 62.84% and a positive predictive value of 58.68%. The balanced accuracy for this class is 74.12%.

  4. Class High: The model performs well with this class, with a sensitivity of 80.87% and a positive predictive value of 81.69%. The balanced accuracy is 87.42% for this class.

The second model outperforms the first one in predicting the test data, with a substantially higher accuracy and moderate agreement between predictions and actual categories. However, there is room for improvement, particularly in predicting the Low Mid and High Mid classes.

Figure 9 below visualises the decision trees.

Figure 9: Results of decision tree models.

Random Forest Model

# Set seed for reproducibility
set.seed(777)

# Filter out NA values
training_visits <- training_visits %>% filter(!is.na(night_owl))

# Random Forest model for profile visits
visits_rf <- randomForest(formula = Profile_Views ~ isOnline + night_owl + age + genderLooking,
                          data = training_visits,
                          importance = TRUE, 
                          ntree = 500)

# View model summary
visits_rf
## 
## Call:
##  randomForest(formula = Profile_Views ~ isOnline + night_owl +      age + genderLooking, data = training_visits, importance = TRUE,      ntree = 500) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 69.17%
## Confusion matrix:
##          Low Low Mid High Mid High class.error
## Low      439      18       38  201   0.3692529
## Low Mid  368      13       24  290   0.9812950
## High Mid 338      12       22  322   0.9682997
## High     254      15       43  383   0.4489209
# Random Forest model for profile likes
kisses_rf <- randomForest(formula = Profile_Likes ~ has_emoji + has_social + Profile_Views + counts_pictures + lang_count + flirtInterests_chat + flirtInterests_date + flirtInterests_friends + counts_details,
                          data = training_kisses,
                          importance = TRUE, 
                          ntree = 500)

# View model summary
kisses_rf
## 
## Call:
##  randomForest(formula = Profile_Likes ~ has_emoji + has_social +      Profile_Views + counts_pictures + lang_count + flirtInterests_chat +      flirtInterests_date + flirtInterests_friends + counts_details,      data = training_kisses, importance = TRUE, ntree = 500) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 27.45%
## Confusion matrix:
##          Low Low Mid High Mid High class.error
## Low      587     116        6    2   0.1744023
## Low Mid  118     422      143    3   0.3848397
## High Mid   9     139      430  111   0.3759071
## High       0       9      107  578   0.1671470
# Predict on test data
visits_rf_pred <- predict(visits_rf, newdata = testing_visits)
kisses_rf_pred <- predict(kisses_rf, newdata = testing_kisses)

# Confusion matrices
confusionMatrix(visits_rf_pred, testing_visits$Profile_Views)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Low Low Mid High Mid High
##   Low      181     154      121  125
##   Low Mid   12       5       14   14
##   High Mid  13      19       18   14
##   High      92     120      145  145
## 
## Overall Statistics
##                                                
##                Accuracy : 0.2928               
##                  95% CI : (0.2671, 0.3195)     
##     No Information Rate : 0.25                 
##     P-Value [Acc > NIR] : 0.0004424            
##                                                
##                   Kappa : 0.057                
##                                                
##  Mcnemar's Test P-Value : < 0.00000000000000022
## 
## Statistics by Class:
## 
##                      Class: Low Class: Low Mid Class: High Mid Class: High
## Sensitivity              0.6074       0.016779         0.06040      0.4866
## Specificity              0.5526       0.955257         0.94855      0.6007
## Pos Pred Value           0.3115       0.111111         0.28125      0.2888
## Neg Pred Value           0.8085       0.744551         0.75177      0.7783
## Prevalence               0.2500       0.250000         0.25000      0.2500
## Detection Rate           0.1518       0.004195         0.01510      0.1216
## Detection Prevalence     0.4874       0.037752         0.05369      0.4211
## Balanced Accuracy        0.5800       0.486018         0.50447      0.5436
confusionMatrix(kisses_rf_pred, testing_kisses$Profile_Likes)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Low Low Mid High Mid High
##   Low      253      48        1    0
##   Low Mid   46     165       55    7
##   High Mid   6      80      186   49
##   High       0       1       54  242
## 
## Overall Statistics
##                                                
##                Accuracy : 0.7091               
##                  95% CI : (0.6825, 0.7348)     
##     No Information Rate : 0.2557               
##     P-Value [Acc > NIR] : < 0.00000000000000022
##                                                
##                   Kappa : 0.6122               
##                                                
##  Mcnemar's Test P-Value : NA                   
## 
## Statistics by Class:
## 
##                      Class: Low Class: Low Mid Class: High Mid Class: High
## Sensitivity              0.8295         0.5612          0.6284      0.8121
## Specificity              0.9448         0.8799          0.8495      0.9385
## Pos Pred Value           0.8377         0.6044          0.5794      0.8148
## Neg Pred Value           0.9416         0.8598          0.8739      0.9375
## Prevalence               0.2557         0.2464          0.2481      0.2498
## Detection Rate           0.2121         0.1383          0.1559      0.2028
## Detection Prevalence     0.2531         0.2288          0.2691      0.2490
## Balanced Accuracy        0.8872         0.7205          0.7389      0.8753

The outcomes obtained from the two separate random forest models are presented here may be interpreted as follows:

Profile Views Random Forest Model

This model aimed to predict ‘Profile_Views’ utilizing the features: ‘isOnline’, ‘night_owl’, ‘age’, and ‘genderLooking’. The training process involved 500 decision trees, with each split in the tree considering 2 variables.

An Out-of-Bag (OOB) error estimate, a commonly used internal measure of the accuracy of random forest models, was computed to be 69.17%. This indicates that the model could not accurately predict the outcomes in about 69.17% of the cases when applied to the out-of-bag sample.

An examination of the confusion matrix reveals varying rates of accuracy across the different classes. For example, the model exhibits the most accurate predictions for the ‘Low’ category, as evidenced by a class error rate of approximately 36.92%. Conversely, the ‘Low Mid’ and ‘High Mid’ categories showed substantial misclassification, reflected by the exceedingly high class error rates close to 98%.

Profile Likes Random Forest Model

The second model sought to predict ‘Profile_Likes’ based on a range of features including ‘has_emoji’, ‘has_social’, ‘Profile_Views’, ‘counts_pictures’, ‘lang_count’, ‘flirtInterests_chat’, ‘flirtInterests_date’, ‘flirtInterests_friends’, and ‘counts_details’. Similar to the first model, this one was also trained using 500 trees. However, at each split, this model considered 3 variables.

The OOB error rate for the second model is substantially lower at 27.45%, suggesting a better fit to the data as compared to the first model.

Upon analyzing the confusion matrix, it can be observed that the model demonstrated reasonable accuracy for the ‘Low’ and ‘High’ classes, with class error rates of 17.44% and 16.71% respectively. Nonetheless, the model encountered challenges with the ‘Low Mid’ and ‘High Mid’ categories, where the class error rates were 38.48% and 37.59% respectively.

Results from Testing Data

The profile views model demonstrates an overall accuracy rate of 29.28%, indicating that it accurately classifies the data approximately 29.28% of the time. The sensitivity, or true positive rate, varies considerably across classes, with the highest rate (60.74%) observed for the ‘Low’ category and the lowest rate (1.68%) for the ‘Low Mid’ category. The specificity, or true negative rate, also varies, ranging from 95.53% for the ‘Low Mid’ category to 55.26% for the ‘Low’ category. These variations suggest differential model performance across classes.

The profile likes outcome reveals a more satisfactory accuracy rate of 70.91%. In this case, both sensitivity and specificity are more evenly distributed across the classes, implying more consistent model performance.

In conclusion, the analysis suggests that the second model is more accurate and robust in making predictions compared to the first. It is also important to note that both models show varying performance levels when applied to different classes, which could be due to distinct characteristics within each class that the models capture with varying degrees of success.

Figure 10 below shows the importance of each variable in terms of predictive power for the profile likes random forest model.

Figure 10: Results of random forest model visualised.

Discussion & Conclusion

The decision tree model demonstrates a level of efficacy; however, it doesn’t fully capture the intricate relationships within the data. It relies solely on one predictor, ‘profile views,’ to formulate its predictions. While ‘profile views’ may be a critical factor, ignoring other variables potentially diminishes the model’s performance.

In comparing the decision tree and random forest models, several key points emerge that offer insights into their relative strengths and weaknesses in this particular application.

Model Complexity and Understanding

The decision tree model has the advantage of being relatively simple to understand and interpret. Each decision within the tree corresponds to a question about one of the variables, making it a model that’s easy to visualize and explain. However, this simplicity can also be a limitation as it may not capture complex interactions among variables. This may explain its less-than-satisfactory performance on certain metrics, like sensitivity and specificity across various classes, and overall accuracy.

On the other hand, the random forest model, which operates by creating a multitude of decision trees and aggregating their results, is capable of capturing more complex patterns and interactions in the data. However, the trade-off is that it’s more challenging to interpret, as it essentially involves a multitude of decision processes rather than just one.

Performance

The decision tree model’s overall performance was modest at best, especially when compared to the random forest models. Both random forest models demonstrated significantly better performance in terms of overall accuracy and class-specific metrics such as sensitivity and specificity. It is worth noting, however, that even the random forest models had substantial differences in performance, likely due to the different variables included in each model and the number of variables tried at each split.

The decision tree’s performance was notably weak when trying to predict certain classes (‘Low Mid’ and ‘High Mid’), indicating that it struggled with differentiating among these classes. This suggests that a single decision tree might not have enough flexibility to capture the nuances of this particular dataset.

Robustness

Random forest models are known to be less prone to overfitting compared to decision tree models. This is because they average the results of many different trees, each of which is trained on a slightly different subset of the data. This difference in robustness is likely a contributing factor to the better performance of the random forest models on the test data.

Computational Complexity

From a computational perspective, the decision tree model is less resource-intensive, making it a more suitable choice for datasets with a large number of variables or instances. Random forests, however, can require significant computational resources, especially as the number of trees increases.

In conclusion, while the decision tree model might be more easily interpretable and computationally efficient, its performance in this specific scenario was significantly outperformed by the random forest models. This indicates that the random forest, with its ability to capture complex interactions and its robustness to overfitting, was more suited to this dataset. It’s a reminder that there’s always a trade-off between interpretability and predictive performance, and the best model depends on the specific context and the requirements of the analysis.